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Classical vs. Bayesian methods for linear system identification 
point estimators and confidence sets 
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Abstract — This paper compares classical parametric methods 
with recently developed Bayesian methods for system identifi¬ 
cation. A Full Bayes solution is considered together with one 
of the standard approximations based on the Empirical Bayes 
paradigm. Results regarding point estimators for the impulse 
response as well as for confidence regions are reported. 

I. Introduction 

Linear system identification is sometimes considered to 
be a mature field, see e.g. [1], [2], In particular parametric 
prediction error methods (PEM) are by now well developed 
and understood. Yet, facing in an effective manner the so- 
called bias variance dilemma hading model complexity vs. 
data fit is still an open issue and, very recently, regularization 
methods for system identification [3], [4], [5] have been 
revitalized; see e.g. [6], [7], [8], [9]. 

In particular experimental evidence has shown that para¬ 
metric methods may give rather unreliable results when 
model complexity is not fixed but has rather to be determined 
from data. Since most criteria for determining complexity 
are derived using asymptotic arguments, this is yet another 
symptom suggesting that asymptotic theory is not to be 
blindly trusted. This is not only related to issues pertaining 
to local minima (as discussed for instance in [10]), but also 
to the fact that it is difficult to say “how much data is 
enough data” to be in the asymptotic regime. These issues 
concerning asymptotic results become even more dramatic 
when parameter estimation has to be coupled with model 
selection, resulting in so called Post Model Selection Estima¬ 
tors (PMSE). [11] have pointed out that asymptotic analysis 
is rather delicate in this case. 

Therefore, if under certain circumstances asymptotic anal¬ 
ysis fails in delivering reliable indications as to the variability 
of an estimator, how would one go about providing, e.g., 
confidence sets for estimated systems? This is certainly of 
primary importance in a system identification exercise as 
one is not only interested in providing estimators for some 
quantity of interest, but also in providing quality tags which 
measure how reliable an estimator is. 

In this paper we shall compare Bayesian methods in a 
noil-parametric setting to classical parametric approaches. 
In particular, we will compare the uncertainty sets which 
can be found following the classical parametric paradigm, 
specifically PEM equipped with BIC criterion and with an 
oracle to estimate model complexity, with the non-parametric 
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approaches, both following the Full Bayes as well as the 
Empirical Bayes methods. 

The paper is organized as follows: Section [II] introduces 
the system identification problem, while Sections III and 


IV respectively illustrate the classical parametric methods 


and the Bayesian non-parametric approach adopted in a 
system identification setting; both the point estimators and 
the confidence sets arising from these two approaches are 
presented. Section[V]provides an experimental comparison of 
these techniques, while Section [VI] draws some final remarks 
on the observed results. 


II. Problem Formulation 

Consider, for the sake of the exposition, a single-input- 
single-output Output Error model: 

y(t) = [h*u](t) + e(t) (1) 

where y(t),u(t) £ R are respectively the measurable input 
and output, e(t) is a zero mean Gaussian white noise 
uncorrelated to u{t) and h(t) is the impulse response of the 
model. 

Given a finite set of input-output data points 
S' = {u[t), ;y(f)}re{i,...,r}, system identification aims at 
estimating the impulse response h(t). Moreover, one could 
also be interested in determining a (random) set which 
is likely to include the unknown true h{t): this range is 
generally referred to as confidence set. In this paper we will 
compare the classical and the Bayesian methods for system 
identification on both these two aspects of the problem. 

In the remaining of the paper, we shall consider {«(?)} 
and {_v(f)} as jointly stationary zero-mean stochastic 
processes and denote with U, Y £ K: 7 ' the vectors with 
entries u(t), y(t), t = 1 respectively. 


III. Classical Identification Methods 
A. Point estimator 

Within the classical parametric identification framework, 
one assumes that the system to be identified belongs to a 
specific model class ..<# (e.g. ARMAX, OE, Box-Jenkins, 
state-space, etc.), which is parametrized through a parameter 
0 £ 0, i.e. .///(0). The commonly used PEM (Prediction 
Error Method) determines the estimate of 0 by minimizing 
the sum of squared prediction errors, i.e.: 

1 T 

§pem =argminJ(0) = argmin - £(y(f) —y(f |0)) 2 (2) 
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where y(f|0) denotes the one-step ahead predictor of the cho¬ 
sen model class. Once 9pem has been determined, one can 
then compute the corresponding impulse response estimate 

^OpEM (O' 

Many interesting properties of these estimators are derived 
using asymptotic arguments, i.e. considering T —t For 
instance, for Gaussian innovations e(t) and for fixed model 
complexity, these methods have proved to be asymptoti¬ 
cally efficient. However, model complexity, which strongly 
affects their effectiveness, has to be estimated from the 
data. Different approaches are commonly exploited for this 
purpose, such as Cross-Validation or the Information Criteria 
(AIC/FPE, BIC/MDL, etc.) which are derived by asymptotic 
arguments. From these considerations a natural question 
arises: how many data have to be considered for these asymp¬ 
totic properties to be reliable in a finite-sample domain? 
The answer is not general and could be really application- 
dependent. 


B. Confidence Set 

1) Asymptotic: Consider the estimate 0; under the as¬ 
sumption that the true system belongs to the chosen model 
class ./// and some other mild assumptions, (e.g. Qpem 
gives rise to a uniformly stable model and the given data 
{y(f)}, {u(t)} are jointly quasi-stationary signals), it holds 
that 

9pem —f ffy), ~jr J j as T —> °° (3) 


where 9q is the unique value in © such that 

9pem ~> 9q, w.p. 1 as T —> « 

and 


(4) 
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See [12] for more details. 

Once 9pem has been determined exploiting the given T 
input-output pairs, the asymptotic covariance 0 can be 
approximated as 


Eg =7(0r) s — j^E Y(t,9pEM)Y{t,9pEMV 


t =t 


Vi *> ®pem) = ~^qS{AQ)\b=^ p . 


•(7) 

( 8 ) 


Notice that, in case of Gaussian innovations Eg coincides 
with the Cramer-Rao lower bound, thus proving the afore¬ 
mentioned asymptotic efficiency of the PEM estimators. 

Observe that the asymptotic covariance 0 describes the 
(asymptotic) confidence set in the space of the estimated 
parameters 9. For further comparison with the Bayesian 
methods, we are also interested in determining a confidence 
set for the estimated impulse response coefficients hg pEM . To 


do this, one could proceed analytically by linearizing the 
map: 

Jz?: © -» R" (9) 

9 h-> h 

and thus directly mapping the parameter confidence set onto 
the space of impulse response coefficients. Notice that, for 
simplicity, we consider a truncated impulse response, where 
the length n can be chosen in order to account only for the 
relevant part of the impulse response. 

In order to avoid the linear approximation introduced by 
the mentioned approach, we prefer to resort to Monte-Carlo 
sampling which yields a point distribution of the confidence 
set in the impulse response space. We first draw N samples 
00) from the distribution pj(- ) ~ {9pem, y ); for each 

of them we build the model .#(00)) and we compute its 
impulse response /z 0 (,) of length n. We then determine the 
confidence set composed by the h„a) associated with the a- 
fraction of the highest probability pr(-), i.e: 

S PEM+ASYMP = C . pr(fl (0) > p PPM + ASYMP 0 (,) g @ j 

( 10 ) 

where p^ EM + A SYMP j s jq _ qj j_p ercen tile of the set 

2) Likelihood Sampling : As an alternative, instead of 
relying on the approximation ([7]i to the asymptotic covari¬ 
ance 0, one could define a confidence set sampling from 
the likelihood function p(y|0,<7 2 ), with a 2 being a noise 
variance estimate (obtained e.g. through a Least-Squares 
model). In fact, assuming a flat prior distribution p{9) for 
the parameters, the likelihood function is proportional to the 
posterior distribution: 

p (01T, d 2 ) - p (T10, d 2 ) = (2?r ct 2 ) exp | ^7(0) | 

(11) 

where 7(0) has been defined in (|2j. Hence, we design an 
MCMC algorithm to obtain N samples 00) from ( fTT) . From 
these we compute the corresponding impulse responses hn{i) 
and we define the set 

S PEM+LIK = | AflW . p{e [t) |y ; &2) > p PEM+UK ) qH) g @ j 

( 12 ) 

where p^ M+LIK is the (1 — a)-percentile of the set 

[p(00)|y,d 2 )}, i- = [i,tv]. 

As previously said, sampling techniques allow to avoid 
approximations of asymptotic expressions. However, they are 
still approximations of the true uncertainty associated to the 
estimated parameter 9pem ■ Indeed, for the definition of the 
previous confidence sets, it has been assumed that the model 
class ./// and the model complexity are fixed, even if in prac¬ 
tice model selection is performed using the available data. 
That is, 9pem is a so-called post-model-selection estimator 
(PMSE): in order to define a more accurate confidence set, 
we should take into account also the uncertainty related to 
the model selection step. However, as emphasized in [11], the 
finite-sample distribution of a PMSE generally has a quite 






intricate shape; moreover, even if one tries to estimate it 
through a sampling method, one has to recall that the finite- 
sample distribution of a PMSE is not uniformly close to its 
asymptotic limit < 0 . 

IV. Bayesian Identification Methods 
A. Point estimator 

Non-parametric approaches to the system identification 
problem follow the Bayesian framework: one postulates that 
the impulse response to be estimated, h(t), is itself a random 
process and one seeks for its posterior distribution given the 
data, p(h\Y). 

The a priori probability distribution given to h(t) is called 
prior 

h~p(h\n) (13) 

and in general depends upon some unknown parameters 77 , 
called hyperparameters hereafter, which need to be estimated 
from data. 

A common and convenient choice is to model h(t) as a zero 
mean Gaussian process, independent of the noise e(t) with 
covariance function K(t,s), i.e. 

E/z(f) = 0 

E/r(f)/z(s) = Kq(t : s) 

The covariance function Kq(t,s) is sometimes called kernel 
in the Machine Learning community. This type of Gaussian 
priors can be derived following Maximum Entropy argu¬ 
ments, see e.g. [13], [14]. 

The minimum variance estimate of the impulse response 
is then given by: 

h = E[h\Y] = J hp(h\Y)dh 

= j J hp{h\T],Y)p{Tl\Y)dhdT] (14) 

= J E[h\Y 1 T}]p{ri\Y)dri 

where E[/?|T, 77 ] is the conditional estimate of h when 77 
are fixed. In a general framework these integrals are not 
analytically tractable and it is necessary to resort to effective 
approximations, e.g. analytical approximations or Markov 
Chain Monte Carlo (MCMC) methods. These approxima¬ 
tions yield to different approaches, such the so-called Em¬ 
pirical Bayes (EB) and Full Bayes (FB) estimators. 

Remark 1: In principle, the estimator G3 belongs to 
an infinite-dimensional space. However, for computational 
reasons, it is general practice to estimate a finite-length 
impulse response, whose length n is chosen large enough to 
capture the dynamics of the estimated system. In this case, 
h S R” is modelled as a zero-mean Gaussian random vector 
with covariance Kq eM" xn . 

1) Empirical Bayes: The Empirical Bayes approach is 
based on the assumption that the marginal posterior distribu¬ 
tion on the hyperparameters 77(77 |T) can be approximated 
by a delta-function centered at its mode 77 ; under this 
approximation the outer integral in (fl4|) is trivially equal to 


E[/z|T, 77 ] evaluated at fj. In order to estimate this value of 
77 the common approach is to consider a non informative 
prior on the hyperparameters and maximize the so-called 
marginal likelihood, p(Y\ri). Under the assumptions on the 
output noise and on the processes {y(f)} • {«(/)} ( see Section 
0 - this marginal density can be computed in closed form, 
as discussed in [ 6 ] and [15], and is given by 

p(Y\ri) = exp ^-iln(det[27rE v (rj)) - 

(15) 

Z y (ri) = <t>Kq<t> T + a 2 I (16) 

where a 2 := Var{e(t)} is the variance of the innovation 
process ([T]) and <t> g R rx " is a matrix built with past input 
data; see [16], [15] for details. 

It follows that we can compute the point estimate of the 
hyperparameters for the EB approach as 


T]eb = argmaxp(y|f 7 ) (17) 

1 


and we can finally obtain the EB estimator of h, Iieb = 
E[/z|T, t)eb\- Notice that, since h(t) and e{t) are Gaussian 
and independent, the convolution is a linear operation, then 
Y and hit) are jointly Gaussian yielding also h conditioned 
on Y be Gaussian for a fixed 77 : 


p{h\Y,ri) ~^(/ir(7]),zr(j])) (18) 

where 

VC ft) = nh\Y,rj] 

= Kq<t> T ($>Kr,<t> T + C7 2 /) Y (19) 

Tpto) = Kq-Kn<i> J Y y (Tir 1( S>Kq ( 20 ) 


Hence the posterior estimate Iieb can be computed in closed 
form using ( fl9| ). 

2) Full Bayes: The Full Bayes approach has the advan¬ 
tage that it does not assume any particular distribution form 
of the marginal posterior /? ( 771 F); therefore in principle, it 
generates a more accurate estimate than the EB estimator 
(under the assumption that the a priori Bayesian model is cor¬ 
rect). As a disadvantage, in general it requires a much higher 
computational effort which, when the marginal posterior 
77(77 |T) is sufficiently peaked, may not be counterbalanced 
by a significant performance increase. 

Here we consider a full Bayes estimator of the impulse 
response h obtained by an adaptive version of the Metropolis- 
Hastings algorithm (Adaptive Metropolis, AM hereafter); see 
[17], [18], 

Recall that the target is to compute the posterior distri¬ 
bution of the impulse response given the data which, as 
mentioned in Section IV-A cannot be computed analytically. 
For this reason, we tackle the problem by approximating the 
posterior as 


p(h\Y) = [ p(h\Y,r])p(ri\Y)dr] 

Jri 


i£/>(/7|F,rj«) (21) 

yv ;=i 



where p(h\Y, 77 W) is the posterior density <01 when the 
hyperparameters are fixed equal to 77 M. 

In order to do this, we need to design an MCMC algorithm 
to draw samples 77 W from p(ri\Y). Observe that: 

p{vlY)= pm*m xpmv) (22) 

where we have assumed that p(jj) is a non informative prior 
distribution. Thus, by using GD we can evaluate p(rf\Y) 
apart from the normalization constant p(Y). 

As mentioned earlier, we have exploited the AM algorithm 
proposed in [18] to obtain the samples 77 W. The basic 
idea which distinguishes the AM algorithm from a regular 
Metropolis-Hasting is to update the proposal distribution 
exploiting the new knowledge which becomes available: 
at each iteration z, the AM algorithm adopts a Gaussian 
proposal distribution centered at the previous sample 
and with a covariance matrix which is adaptively updated 
based on the samples The updating recursion 

formula for the covariance matrix given in [18] is: 

^?+t = — *%+ s 4 (if, {i) f, (i)T +77 (i V )T +el d ) (23) 

1 1 

where )]/ is the mean after k samples, s,i is a regularization 
parameter, Ij is the identity matrix of dimension d, which 
is the dimension of the hyperparamters, and e > 0 is an 
arbitrarily small constant. The value of the regularization 
parameter initially has been chosen to be sj = a value 
which gives good mixing properties in the Metropolis chain 
under the assumption of Gaussian targets and proposal, as 
shown in [19], then it has been empirically adjusted in order 
to have an acceptance rate of the MCMC algorithm around 
the 30%. 

The algorithm we implemented in order to obtain the FB 
estimate li/ n is briefly outlined in the following. 


Algorithm 1: 

Sample hyperparameters through an AM algorithm 

1) Initialize the proposal density qfi-) for the AM algo¬ 
rithm: set qo(-) = J#o), with 




dr \n[p{Y\f]i iB )p{f]EB)] 
drjdri 7 


2) For i > 0 Iterate: 

. Sample 77 from <?,( jT)( !_I )) ~ rV Mf)) 

• Sample u from a uniform distribution on [0,1] 

• Set 


77 « = 


1 


(<-l) 


if < P(Y\ri)p(ri) 

otherwise 


Compute M\e\ according to equation 


3) After a (sufficiently long) burn-in period, keep the 
last A samples 77 W which are (approximately) samples 
from p(r] |T). 


Estimate the impulse response: 


4) For i = 1 to A do 


• Compute Mr(77«),Er(77 W ) as in < P1 ( [20| - 
. Sample from ^(p F ° st (TJ (i) ),Zf *( 77 $)) 

5) The samples lv l! obtained above are samples from 
p(h\Y). The Minimum Variance estimate of h is finally 
computed as: 

1 N 

^« = ( 24 > 

i=\ 

B. Confidence Set 

Within the Bayesian framework, the confidence of the final 
estimator is described by the posterior density p(li\Y). Since 
the Empirical Bayes (EB) and the Full Bayes (FB) estimators 
lead to different approximations of p(h\Y), they will also 
lead to different definitions of the confidence set, as will be 
illustrated in the following. 

1) Empirical Bayes: When the Emprical Bayes approach 
is considered, the posterior p(h\Y) is the Gaussian distri¬ 
bution defined in ( [T 8 | with 77 fixed to t)eb- Hence, one can 
define the following ellipsoidal confidence region in R", with 
n being the length of the estimated impulse response, i.e. 
Iieb € M ”: 

g™ = [x G P : (x — h EB ) T ^ l EB (x -h EB )< xl («)} (25) 

For a fixed probability level a , Xa( n ) is the value for which 
Pr (X 2 («) < Xa( n )) = a - $a B defines the region in which a 
sample from p{h\Y) will end up with probability a. Note, 
for future use, that this set corresponds also to the set of 
“size” (= probability) a which satisfies: 

p{h s \Y)>p(h S c\Y) V h s &g EB (a) hgc ^ g EB (a) 

(26) 

To have a confidence set comparable to the ones defined for 
the classical methods, we approximate the set ( |25| by a point 
distribution obtained by sampling the posterior distribution 
p(h\Y, Beb) and retaining only the samples which belong to 
( |25| i, that is: 

S EB = j/z® £ R" : h (i) G g EB |, (27) 

2 ) Full Bayes: The FB estimator we previously described 
exploits the sample approximation to the posterior distribu¬ 
tion in ©■ Due to the non-Gaussianity of this approximated 
distribution, we can not define an ellipsoidal confidence 
region. However, an appropriate a-level confidence set is 
given by: 

S F a B = j/z« eR":i f p(h®\Y,rij) > p FB |, (28) 

where p FF is the (1 — a)-percentile of the set 

i = 1,...,A 

yv 7=1 

That is, S'ffi contains the impulse response samples li iJI 
associated with the a-fraction of the highest values of the 
approximated posterior (| 2 Tj). 









V. Simulations 

The performance of the described system identification 
approaches, EB, FB and PEM, are evaluated by using a 
Monte Carlo study over 100 datasets. At each run a model 
such as {[} is estimated together with a confidence set around 
the estimated impulse response. The performance of the 
estimators are compared both in terms of impulse response 
fit as well as of the accuracy of the corresponding confidence 
set, determined as illustrated in Sections [M] and [TV] 

A. Data 

The data-bank of system and input-output data used in our 
experiments have been already used and introduced in [20]. 
In particular, we applied the identification techniques to the 
data sets “D4” which is briefly described in the following. 
The data set consists of 30th order random SISO dicrete-time 
systems having all the poles inside a circle of radius 0.95. 
These systems were simulated with a unit variance band- 
limited Gaussian signal with normalized band [0,0.8]. A zero 
mean white Gaussian noise, with variance adjusted so that 
the Signal to Noise Ration (SNR) is always equal to 1, was 
then added to the output data. The number of input-output 
data pairs is 500. 

In addition, we experimented the dataset “S1D2” introduced 
in [16]. The results were similar to the ones obtained on 
dataset “D4” and outlined in the following; therefore, we 
are not going to report them here. 

B. Estimators 

1) PEM: In the simulations we performed, the chosen 
model class for the PEM methods is OE. Model selection 
has been performed through BIC criterion, since it 
generally outperforms AIC. We will denote this estimator 
as PEM+BIC. 

Moreover, as a reference we also consider an oracle 
estimator, denoted by PEM+OR, which has the (unrealistic) 
knowledge of the impulse response of the true system, h : 
among the OE models with complexity ranging from 2 to 
30, it selects the one which gives the best fit to h. 


For ease of notation, we will now use the apex X to denote 
a generic estimator among the ones previously illustrated, 
that is, PEM+BIC, PEM+OR, EB and FB. 

C. Impulse Response Fit 

As a first comparison, we would like to evaluate the 
ability of the considered identification techniques on the 
reconstruction of the true impulse response. Thus, for each 
estimated system and for each estimator X we compute the 
so-called impulse response fit: 

^<S) = 100 x] 1(30) 

where h. h are the true and the estimated impulse responses 
of the considered system. 
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Fig. 1: Monte Carlo results. Boxplots of the impulse response 
fit for the compared identification techniques. 


Impulse Response Fit 



PEM+OR PEM+BIC 


Figure |T| displays the boxplots of index ( [30| for the 4 
estimators and the resulting average can be seen in Tabel [T] 


2) EB, FB: For the Bayesian estimators illustrated in 
Section IV the choice of the prior distribution on the 
impulse response to be estimated is a crucial point for the 
identification problem. 

The experiments we present in this Section have been ob¬ 
tained adopting a zero-mean Gaussian prior with a covariance 
matrix (kernel) given by the so-called “DC”-kernel: 


K° c {kJ)=cp\ k -j\l( k +i)/ 2 


(29) 


where c > 0, 0 < A < 1 and |p| < 1 are the hyperparameters 
which form the set T] = {c,p,/ 1}. For further details on the 
meaning of these hyperparameters and on the properties they 
induce in the estimated impulse response we refer to [16], 
where the DC kernel has been proposed. 

The length n of the estimated impulse responses has been 
set to 100. 


PEM+OR 

PEM+BIC 

EB 

FB 

Fit Mean 79.1547 

60.6676 

78.4420 

78.1332 


TABLE I: Comparison of average impulse response fit. 


The oracle estimator PEM+OR sets an upper bound on 
the achievable performance by a parametric methods; we 
can note that EB performs remarkably well, with only a 
slightly inferior fit. The FB estimator performs similarly 
to EB, but it requires the implementation of a MCMC, 
which is highly computationally expensive. These results 
suggest that the marginal posterior p(i) |T) is sufficiently well 
peaked to be approximated by a delta function (meaning that 
p(h\Y) ~ p(h\Y. f] EB )). The PEM+BIC estimator has weaker 
performances: a lower median and a long tail of systems 
with low fit are obtained. This is most likely due to the 




















low pass characteristics of the input signal, which make 
the order estimation step particularly delicate. Indeed, in the 
dataset “S1D2” where the inputs were Gaussian white noises, 
PEM+BIC performed similar to the Bayesian estimators. 


D. Confidence Set Indexes 


tions 


The confidence sets which have been introduced in Sec- 

are . gPEM+OR+ASYMP gPEM+OR+LIK 


III-B 


and 


IV-B 


SPEMmtmSYMP^mt+BIC+UK' S EB and gFB Afj before ,^ 


will generic ally denote one of them. 

In the simulations we present, the previously defined con¬ 
fidence sets are made of N = 7200 samples and we set 
a = 0.95. These are only approximations of a “true” a-level 
confidence set and thus our aim is to study how well they 
perform both in term of “coverage” (how often does the a- 
level confidence set contain the “true” value?) as well as of 
size (how big is an a-level confidence set?). Unfortunately, 
since our sets are only defined through a set of points, it is 
not possible to define a notion of inclusion (does the true 
system belong to the confidence set?) and as a proxy to this 
we thus consider the following index which measures the 
relative distance from the true system and the closest point 
within the confidence set: 


1) Coverage Index: For a fixed probability level a, it is 
given by 

Jf[a) := min ^ ^ (31) 

xes% \\h\\ 


where h denotes the true impulse response. For future 
analysis the usage of the concept “coverage” will be 
meant as in definition ( |3T] >. 

As far as the “size” of the confidence sets we consider 
the index: 

2) Confidence Set Size: It evaluates the area of the interval 
which includes the whole slot of impulse responses 
contained in S Let us define the vectors h x € R" and 
h x £ R" whose y-entries are h x (j) :=max,/iW(y) and 
h x (j) := min,7i^(y), respectively, with ZtW g S the 
index we consider is defined as: 


i( a ) = ')-h x U) 

7=1 


(32) 


Referring to Figure [2j a large confidence set is more likely 
to contain the true impulse response, giving a low value of 
y x (a), but it will also denote a large amount of uncertainty 
in the returned estimate, thus leading to a large value of 

Figure [3] illustrates the boxplots for index ( |3 1 [ ). The con¬ 
fidence sets of the oracle perform well in terms of coverage, 
which is rather obvious because the estimator is selected by 
the oracle if its relative distance to the true system is small. 
EB and FB provide very similar performance in terms of 
coverage, outperforming the confidence sets computed from 
the parametric approach endowed with BIC. 

Figure [4] illustrates the boxplots for index ( |32[ i. The 
EB confidence set has a remarkably smaller size than 
the others. The size of the FB confidence set is slightly 


Illustration of Confidence Set Size 



Fig. 2: Illustration of the idea of the Confidence set size index 
for a single system. 


Coverage Index 



Fig. 3: Monte Carlo results. Boxplots of the Coverage Index 
for the compared identification techniques. 


larger than the EB one, which is rather obvious since also 
uncertainty related to the hyperparameters is accounted for. 
The parametric methods PEM+OR and PEM+BIC have 
larger confidence sets than the Bayesian ones. In particular, 
the two PEM+OR confidence sets are larger than the ones 
returned by the PEM+BIC estimator: this can be explained 
from the fact that PEM+OR tends to select higher-order 
models, thus bringing more uncertainty into the estimated 
systems. Comparing the Asymptotic and the Likelihood 
Sampling confidence sets it is clear that the latter is slightly 
more precise than the former. This is due to the fact that the 
Asymptotic confidence set is an approximation which holds 
for large data sets, while the Likelihood Sampling is correct 
for any finite sample size; however, this improvement comes 











































Confidence Set Size 



PEM+OR PEM+OR PEM+BIC PEM+BIC EB FB 

ASYMP UK ASYMP LIK 

Fig. 4: Monte Carlo results. Boxplots of the Confidence Set 
Size for the compared identification techniques. 


at a rather high computational price needed to run the 
MCMC sampler. It is important to note that the asymptotic 
theory does not take into account stability issues: namely, 
the confidence set derived from the Gaussian asymptotic 
distribution ([3]) could contain unstable impulse responses. 
Therefore the sampling procedure described in Section 
|III-B.l| could yield to diverging confidence set size. In order 
to avoid this problem we truncated the asymptotic Gaussian 
distribution within the stability region. Clearly, this fact 
shows an intrinsic problem of the asymptotic theory. We 
should also like to recall that the asymptotic as well as 
likelihood based confidence intervals do not account for 
uncertainty in the order estimation step. 


By comparing the results in both Figures 3]|4 


we can 

conclude that: among the feasible identification methods, 
EB and FB are preferable both in terms of coverage as 
well as size. In this case there seems to be no gain in 
using the much more computationally expensive FB. It seems 
also fair to say that the confidence sets attached to the 
parametric approaches, even those of the oracle estimators, 
are significantly worse than those obtained from the Bayesian 
methods. Not surprisingly, focusing on the parametric con¬ 
fidence sets, the ones obtained through sampling techniques 
are significantly smaller than the “asymptotic” counterparts, 
but at the cost of an extra MCMC algorithm. 


Remark 2: At this point one could argue that the sets 
S x t are only “sample” approximations of a confidence set, 
while one may be interested in having a bounded region as 
a confidence set. In the case of the EB estimator this region is 
directly defined since the posterior distribution is Gaussian, 
thus naturally leading to ellipsoidal confidence regions ( |25| . 
For all the other estimators, it is in principle possible to build 
outer approximations of the confidence sets e.g. building 
a minimum size set which includes all the points in S x t ; 


examples are the convex hull or an ellipsoid. 

The convex hull can be computed with off-the-shelf algo¬ 
rithms (such as the Matlab routine convhulln.m), while 
the smallest ellipsoid (in terms of sum of squared semi-axes 
length) can be found solving the following problem: 


Pa P ‘ ,ca* '■= arg min Trace P 

P,C 

P (h® - c) 
S ‘ tm [ (/,(/) _ c) T ! 

h [i) es x a 


^ 0 , 

(33) 


See [21] for further details. The corresponding ellipsoid is 
then given by 

C pt = {* £ r : {x-ct?{P°a Pt T\x-ct) < 1} (34) 

However, the computation of the convex hull as well as the 
solution of the optimization problem ( |33| ) become computa¬ 
tionally intractable for moderate ambient space and sample 
sizes. E.g. when the impulse response lives in K", n = 
100, the set S x a contains N = 7200 this computations are 
prohibitive with off-the-shelf methods. To overcome this 
issue, we tried to approximate the optimal ellipsoid 6"a P ’ by 
using the sample mean h s x and the sample covariance T. s x 
of the elements in S x ; namely: 

= { x e R" : (d X ) T ’E.-Jd X < k^ j , 

= x ~ h s x a (35) 

where k^ is a constant appropriately chosen so that all the 
elements of S x fall within fi x . However, it can be observed 
that these ellipsoids are rather rough approximations of the 
sets S\. E.g., inspecting 2D sections of the n-dimensional 
ellipsoids, it can be seen that often the axis orientation was 
not correct, thus leading to sets which are much larger than 
needed. This fact was mainly observed for the confidence 
sets related to PEM estimates. 

These observations suggest that the quality of the confidence 
sets obtained through the ellipsoidal approximation © 
would have been highly dependent on the quality of the fitted 
ellipsoid. Therefore, we concluded that a comparison among 
the different estimators, based on this kind of confidence set, 
would have led to unreliable results; therefore such results 
have not been reported. 


VI. Conclusions 

We have presented an in-depth comparison between para¬ 
metric and Bayesian methods for system identification. Our 
results complement previous findings showing that Bayesian 
methods not only outperform parametric methods in terms 
of point estimators, but also provide better approximations 
for uncertainty regions. From our limited experience there 
seems to be very little advantage in using Full Bayes 
approaches which entail a much higher computational load 
than Empirical Bayes methods. It is interesting to note that 
Bayesian estimators and their confidence sets are competitive 
even with the parametric methods equipped with an oracle 































which has the knowledge of the true impulse response. 
In addition, with regard to the parametric techniques, we 
showed that the confidence sets obtained through sampling 
techniques improve the ones returned by the “asymptotic” 
approximation. 
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